Video to Text Summary: Joint Video Summarization and Captioning with Recurrent Neural Networks
نویسندگان
چکیده
Video summarization and video captioning are considered two separate tasks in existing studies. For longer videos, automatically identifying the important parts of video content and annotating them with captions will enable a richer and more concise condensation of the video. We propose a general neural network configuration that jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. Our main idea is that the summary signals can help a video captioning model learn to focus on important frames. On the other hand, caption signals can help a video summarization model to learn better semantic representations. Jointly modeling both the video summarization and the video captioning tasks offers a novel end-to-end solution that generates a captioned video summary enabling users to index and navigate through the highlights in a video. Moreover, our experiments show the joint model can achieve better performance than state-of-the-art approaches in both individual tasks.
منابع مشابه
Multimodal Memory Modelling for Video Captioning
Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is st...
متن کاملEnd-to-End Video Captioning with Multitask Reinforcement Learning
Although end-to-end (E2E) learning has led to promising performance on a variety of tasks, it is often impeded by hardware constraints (e.g., GPU memories) and is prone to overfitting. When it comes to video captioning, one of the most challenging benchmark tasks in computer vision and machine learning, those limitations of E2E learning are especially amplified by the fact that both the input v...
متن کاملText Generation using Generative Adversarial Training
Generative models reduce the need of acquiring laborious labeling for the dataset. Text generation techniques can be applied for improving language models, machine translation, summarization, and captioning. This project experiments on different recurrent neural network models to build generative adversarial networks for generating texts from noise. The trained generator is capable of producing...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملAutomatic Video Captioning via Multi-channel Sequential Encoding
In this paper, we propose a novel two-stage video captioning framework composed of 1) a multi-channel video encoder and 2) a sentence-generating language decoder. Both of the encoder and decoder are based on recurrent neural networks with long-short-term-memory cells. Our system can take videos of arbitrary lengths as input. Compared with the previous sequence-to-sequence video captioning frame...
متن کامل